Practical Suffix Tree Construction
نویسندگان
چکیده
Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree index on the string dataset. Although suffix trees can be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n) complexity outperforms the popular O(n) Ukkonen algorithm, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm’s performance. To address this problem, we present a buffer management strategy for the O(n) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature. Our approach far outperforms the best known diskbased construction algorithms.
منابع مشابه
ERA Revisited: Theoretical and Experimental Evaluation
Efficient construction of the suffix tree given an input text is an active area of research from the time it was first introduced. Both theoretical computer scientists and engineers tackled the problem. In this paper we focus on the fastest practical suffix tree construction algorithm to date, ERA. We first provide a theoretical analysis of the algorithm assuming the uniformly random text as an...
متن کاملDynamic extended suffix arrays
The suffix tree data structure has been intensively described, studied and used in the eighties and nineties, its linear-time construction counterbalancing his spaceconsuming requirements. An equivalent data structure, the suffix array, has been described by Manber and Myers in 1990. This space-economical structure has been neglected during more than a decade, its construction being too slow. S...
متن کاملOn – line construction of suffix trees 1
An on–line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It has always the suffix tree for the scanned part of the string ready. The method is developed as a linear–time version of a very simple algorithm for (quadrat...
متن کاملA Dynamic Approach to Weighted Suffix Tree Construction Algorithm
In present time weighted suffix tree is consider as a one of the most important existing data structure used for analyzing molecular weighted sequence. Although a static partitioning based parallel algorithm existed for the construction of weighted suffix tree, but for very long weighted DNA sequences it takes significant amount of time. However, in our implementation of dynamic partition based...
متن کامل2 Compact Suffix Arrays
The suffix array data structure that we present is due to Grossi and Vitter [1]. It uses a recursive construction that inflates the alphabet size, much like the the suffix array construction that we saw in Lecture 18. Building on this, we will construct a low-space suffix tree by augmenting this suffix array with an additional tree structure. This construction is due to Munro, Raman and Rao [2]...
متن کامل